Ensuring veracity of input corpus for effective artificial intelligence applications: veracity index

ABSTRACT

Artificially Intelligent systems are able to draw inferences and conclusions by analyzing information in natural language and then using such information to prove or disprove hypotheses. The quality of such inferences is directly dependent on the accuracy of the input data corpus. Given the proliferation of the Internet as well as the dubious data sources on social media, it is important to determine the truthfulness of the input information. Combining concepts of library classification, crowd-sourced curation and Google Scholar search, we propose the concept of the Veracity Index and an algorithm to calculate it. This index can be used in Artificial Systems to determine the confidence measurement of the inferences.

BACKGROUND OF THE INVENTION 1. Field of the Invention

The invention generally relates to artificial intelligence and, more specifically, to Ensuring Veracity of Input Corpus for Effective Artificial Intelligence Applications: Veracity Index.

2. Description of the Prior Art

Human life has perpetually been influenced by the evolution of technology. We have continuously developed products that improve our way of life. Recently, Artificial Intelligent (AI) systems with the engagement capability have completely changed the way humans and systems interact (IBM and Softbank, Fortune.com). Studies have estimated that the incremental productivity value of automating knowledge work could exceed $5.2 trillion dollars by 2025—a significant amount (HorizonWatch 2015 Trend Report).

AI systems can learn and reason (IBM Research, Research.ibm.com), while comprehending data using a natural language, like English, rather than a computer-based language, such as C++ (The Mind Project, Mind.ILSTU.edu). Furthermore, Artificial Intelligence demonstrates great promise in aiding doctors, bankers, engineers, and other professionals to arrive at accurate decisions quickly (IBM Content, TheAtlantic.com). For instance, at Memorial Sloan Kettering Cancer Center in New York, USA, IBM's Watson has been ‘training’ for over a year to develop an effective tool in order to help medical professionals customize treatment plans for cancer patients (Watson Oncology, MSKCC.org).

An important aspect of Artificial Intelligence is, undeniably, the machine learning aspect. A computer first analyzes large amounts of data and information from numerous sources. Subsequently, the data is synthesized into insights, which would then be presented. The various sources of the aforementioned information could be newspaper articles, reputable journals (that were peer reviewed), and most importantly, the Internet. A major inhibitor to the entire machine learning process is that many Internet articles are not, in fact, accurate; thus tainting the quality of the output given the data on which inferences are based is incorrect.

SUMMARY OF THE INVENTION

To mitigate this challenge, we propose an approach that combines both the well-established classification methods such as the Dewey Decimal Classification with crowd sourced curation approaches such as “Wikipedia” to generate a measure for authenticity. We call this the “Veracity Index.” This index will be used by machine learning algorithms to decide whether to include or exclude source within data corpus or not. We also propose the algorithm to calculate the confidence of recommendations based on the veracity of input data.

BRIEF DESCRIPTION OF THE FIGURES

The above and other aspects, features and advantages of the present invention will be more apparent from the following description when taken in conjunction with the accompanying drawings, in which:

FIG. 1 illustrates the flow chart corresponding to the veracity algorithm;

FIG. 2 illustrates the flow chart for the calculation of confidence score based on input veracity; and

FIG. 3 is a system block diagram for implementing the invention illustrated in FIGS. 1 and 2.

DETAILED DESCRIPTION Overview of The Veracity Application Algorithm

The authors of articles submit their articles for evaluation, after classifying the subject area. Based on the subject area, the application identifies the appropriate experts from the expert pool and triggers the workflow to them to assess the article. The expert pool assesses the article. Based on the assessment of the individual experts, the application calculates the veracity index of the article and also updates the eminence index of the author for this subject area.

Details of the Algorithmic Elements Step A1. Article Submission

The author submits the article for veracity index assessment. The author determines the subject area. Author self-classifies the article using the Dewey Decimal Classification. Dewey Decimal Classification is ideal as it is a hierarchical approach that effectively organizes content based on 10 classes, each having ten sections. It also incorporates faceted classification that allows capturing of multiple subject areas in the classification. (The Editors of Encyclopaedia Britannica, Encyclopedia Britannica).

Step A2. Subject Verification

The algorithm verifies the subject area proposed by the authors. With advances artificial intelligence techniques, using natural language processing and unstructured data analysis the subject area is determined. In case the system generated a different subject area than the one proposed by the authors, workflow is triggered to the authors to verify their classification and resubmit.

Step A3. Eminent Pool Determination

The algorithm, based on the subject classification, identifies the eminent experts. Workflow is triggered to five eminent experts. The workflow assignment distributes the workload by looking at the current and historical requests to these eminent experts, in choosing to whom the workflow is directed in Step A7.

In case there are no eminent experts in the subject area, it uses Step A4 to find an alternative.

Step A4. Identification of Adjacent Experts

If there are no eminent experts identified in the subject area of classification, the algorithm looks for the experts in the adjacent fields. Dewey Decimal Classification makes aids in the task as either the parent classification in the hierarchy or the subject areas that went into the faceted classification determine the adjacency. Next step is Step A5.

Step A5. Adjacent Expert Recommendation

Up to six adjacent experts are polled for their recommendation of the eminent experts in the required subject field, by forwarding the original article. These adjacent experts recommend multiple authors and proposed eminent indices for them. These recommendations are incorporated in Step A6.

Step A6. Assign Eminence Index

Aggregate the recommendations of the adjacent experts to calculate the Eminence Index of the authors for the corresponding subject.

Step A7. Assess Veracity

The experts assess the veracity of the article using the Veracity Index on a Likert Scale of 1-10.

Step A8. Calculate the Veracity Index and Author's Eminence Index

The Veracity Index of the article is averaged across the expert pool. While in the beginning the expert pool is small, as with crowd-sourced approaches, it gets enriched as the index gains traction. Once there is a critical mass of experts, the Veracity Index can be further refined by discarding the outliers in the veracity assessment, thus further reducing the subjectiveness. Experts in the field also have the option of voluntarily assessing the veracity of articles. Whenever there is an individual assessment of the veracity of the article for a specific subject, the article veracity is recalculated incorporating the new assessment. In addition to the Veracity Index of the article for each subject, the count of the expert assessors is also tracked, their average Eminence Index and the weighted average Veracity Index (weighted average of veracity assessment×eminence index). This full vector of Veracity Index, Weighted Veracity Index, Count of Expert assessors, and Expert Average Eminence Index will paint a full picture of the quality of the article. Articles with veracity index are maintained in data bases at A9 and authors with eminence index are maintained at A10.

After the Veracity Index of the article is computed, it is used to further recalculate the author's Eminence Index.

These calculations are algebraically described in the following section.

Eminence and Veracity Index Calculation Notations

I index for author J index for subject K index for the article L index for expert I max count for authors J max count for subjects K max count for the articles L max count for experts S_(jk) j_(th) Subject for article k V_(kj)(l) Veracity for article k in subject j rated by expert l V_(kj) Veracity for article k in subject j E_(ij) Eminence for Author i in subject j A_(kl) l_(th) author for article k

Computations

Veracity Index for article k in subject j For each k,

For each j

-   -   For each 1

$V_{kj} = \frac{{V_{kj}*{{Count}\left( V_{kj} \right)}} + {{V_{kj}(l)}{{Count}\left( V_{kj} \right)}(1)}}{{{Count}\left( V_{kj} \right)} + 1}$

Eminence Index for Author i in Subject j

For each i,

For each j

$E_{ij} = \frac{{E_{ij}*{{Count}\left( E_{ij} \right)}} + V_{kj}}{{{Count}\left( E_{ij} \right)} + 1}$

Application to Artificial Intelligence Confidence Score Calculation

Artificial intelligence systems formulate hypotheses and test these hypotheses based on the inputs available. For example, in the medical field, the systems examine the list of symptoms and match each of the symptoms to corresponding hypotheses of diagnoses. Based on the number of symptoms that match the diagnoses, they determine the confidence value of the diagnoses. The AI systems are also able to keep track of the sources that provide the symptom-diagnosis mapping. Once we have the veracity indices of the sources available A9-1 to A9-n, the confidence calculations can now be revised to be weighted by the veracity index of the source article. The documents with veracity indexes A9-1 to A9-n are tested to formulate hypothesis at A11 and an aggregate veracity index of the pool is established at A12. The confidence score is then updated by the weighted average of the veracity index. In addition, the AI systems can also be programmed to only use articles with veracity index higher than a certain threshold to be included in the input corpus. The approach is shown in FIG. 2.

Referring to FIG. 3, the block diagram for the system for implementing in accordance with the invention shown in FIGS. 1 and 2 is generally designated by the reference numeral 10.

The system 10 includes a computer or CPU 12 that has input 14 for inputting data, on a predetermined subject, the veracity of which is to be authenticated or verified documents, articles, etc. A database 16 is provided for storing eminent expert information on a predetermined subject that can communicate with the CPU 12. A further database 18 is provided for storing information on adjacent experts on the predetermined subject, which is in by directional communications with the CPU 12. A further database 20 is provided that contains information about data, documents and articles the veracity of which has been determined by the expert pool that can also be accessed by the CPU 12. A database 24 contain data about authors with eminence index is also in bi-directional communications with the CPU 12 as is a database 26 that contains data with regard to articles/documents each of which is assigned with a veracity index.

In operation, data or other documents are submitted for veracity determination at input 14 to the CPU 12. The CPU determines if the document has a subject that is verified. The CPU determines whether the identified subject can be associated with an eminent expert pool. The CPU accesses the eminence expert database 18. If no eminent experts are available in the database 18 the CPU accesses the adjacent expert database 16 to determine or suggest adjacent experts who can, in turn, identify potential eminent experts in the predetermined subject. The CPU 12 assigns an eminence index to the experts in the subject area for selection of experts that can access veracity. The eminent experts assign a veracity index to the data, article, etc. that is stored in the veracity database 20, and articles that have veracity index are stored in the database 22 while authors with eminence index are stored in database 24. The documents with veracity index are used to compute an aggregate veracity index that, in turn, establishes a confidence score by weighted average of the veracity index. The CPU 12, therefore, with the document input and the databases described can carry out the method shown in the flowcharts of FIGS. 1 and 2.

An incentive arrangement as well as a scheduling method equally distributes the workload of the expert pool.

Artificial Intelligence is clearly a major component of technology in the future. However, Artificial Intelligence has the same Achilles heel as any other computing system—the “Garbage In, Garbage Out” paradigm. The quality of the recommendations are heavily dependent on the quality of the input data. Our proposed approach can correct this critical flaw, improving the efficacy of Artificial Intelligence to enrich our lives.

The foregoing is considered as illustrative only of the principles of the invention. Further, since numerous modifications and changes will readily occur to those skilled in the art, it is not desired to limit the invention to the exact construction and operation shown and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the invention.

REFERENCES BOOK

The Editors of Encyclopaedia Britannica. “Dewey Decimal Classification|Library Science.” Encyclopedia Britannica. Encyclopedia Britannica, n.d. Web. 13 Jan. 2016.

Magazine/Newspaper

-   IBM Contributor, IBM. “What's The Future Of Artificial Intelligence?     IBM Watson.” Forbes. Forbes Magazine, 23 Feb. 2015. Web. 13 Jan.     2016. -   IBM Content, Sponsored. “Watson Takes The Stand.” The Atlantic.     Atlantic Media Company, 2 Mar. 2015. Web. 13 Jan. 2016. -   “This Robot Could Soon Be in Classrooms.” Fortune. IBM and Softbank,     7 Jan. 2016. Web. 13 Jan. 2016.

Non-Print/Internet Sources

-   “Artificial Intelligence—HorizonWatch 2015 Trend Report.” Artificial     Intelligence —HorizonWatch 2015 Trend Report. IBM, 27 Jan. 2015.     Web. 13 Jan. 2016. -   Baer, Drake. “‘Machine Learning’ Is A Revolution As Big As The     Internet Or Personal Computers.” Business Insider. N. p., 2017. Web.     5 Aug. 2017. -   “Elon Musk Thinks Governments Should Study Artificial Intelligence.”     Fortune.com. N. p., 2017. Web. 5 Aug. 2017. -   Lee, Kai-Fu. “Opinion|The Real Threat Of Artificial Intelligence.”     Nytimes.com. N. p., 2017. Web. 5 Aug. 2017. -   “IBM Research: Why Cognitive Systems?” IBM Research: Why Cognitive     Systems? IBM Research, n.d. Web. 13 Jan. 2016. -   “Introduction to Natural Language Processing.” The Mind Project.     Consortium on Cognitive Science Instruction, n.d. Web. 13 Jan. 2016. -   “Watson Oncology.” Watson Oncology. Memorial Sloan Kettering Cancer     Center, n.d. Web. 13 Jan. 2016. 

What is claimed:
 1. Method of ensuring veracity of input data by an author of the input data, comprising the steps of establishing the subject matter of the input data; if the subject matter of the input data is verified establishing whether there exists an eminent expert pool expert in the identified subject matter; if an expert pool for the identified subject matter exists assessing the veracity of the input data by said expert pool and updating input data veracity index and author eminence index based on expert pool assessment; establishing aggregate veracity index of the pool of input data with veracity indices; and updating a confidence score by weighing the average of the veracity indices to enhance the veracity of the input data.
 2. Method of ensuring veracity of input data as defined in claim 1, wherein in the absence of an eminent expert pool identifying adjacent experts for suggesting eminent experts.
 3. Method of ensuring veracity of input data as defined in claim 2, further comprising the step of assigning an eminence index to eminent experts identified by said adjacent experts.
 4. Method of ensuring veracity of input data as defined in claim 1, wherein the veracity index for input data k in subject j is established as follows: $V_{kj} = \frac{{V_{kj}*{{Count}\left( V_{kj} \right)}} + {{V_{kj}(l)}{{Count}\left( V_{kj} \right)}(1)}}{{{Count}\left( V_{kj} \right)} + 1}$ where I index for author J index for subject K index for the article L index for expert I max count for authors J max count for subjects K max count for the articles L max count for experts S_(jk) j_(th) Subject for article k V_(kj)(1) Veracity for article k in subject j rated by expert 1 V_(kj) Veracity for article k in subject j E_(ij) Eminence for Author i in subject j A_(kl) l_(th) author for article k
 5. Method of ensuring veracity of input data as defined in claim 1, wherein eminence index for author I in subject j is established as follows: $E_{ij} = \frac{{E_{ij}*{{Count}\left( E_{ij} \right)}} + V_{kj}}{{{Count}\left( E_{ij} \right)} + 1}$ where I index for author J index for subject K index for the article L index for expert I max count for authors J max count for subjects K max count for the articles L max count for experts S_(jk) j_(th) Subject for article k V_(kj)(1) Veracity for article k in subject j rated by expert 1 V_(kj) Veracity for article k in subject j E_(ij) Eminence for Author i in subject j A_(kl) l_(th) author for article k
 6. Method of ensuring veracity of input data as defined in claim 1, wherein said input data comprises a document.
 7. Method of ensuring veracity of input data as defined in claim 6, wherein said document comprises an article.
 8. A system for ensuring veracity of input data by an author on a predetermined subject comprising a computer with an input for entering the input data; means for establishing the subject matter of the input data; means for establishing whether there exists an eminent expert pool expert in the identified subject matter; if an expert pool for the identified subject matter exists assessing the veracity of the input data by an expert pool; means for updating input data veracity index and author eminence index based on expert pool assessment; means for establishing aggregate veracity index of the pool of input data with veracity indexes; and means for updating a confidence score by weighing the average of the veracity indices to enhance the veracity of the input data.
 9. Method of ensuring veracity of input data as defined in claim 1, wherein said means for determining the existence of an eminent expert pool comprises an eminent expert database accessible by said computer.
 10. Method of ensuring veracity of input data as defined in claim 1, wherein said means for updating said input veracity index comprises a veracity database accessible by said computer.
 11. Method of ensuring veracity of input data as defined in claim 1, wherein said means for updating said author eminence index comprises an author with eminence index data base accessible to said computer.
 12. Method of ensuring veracity of input data as defined in claim 1, wherein said computer is programmed to establish veracity index for input data k in subject j by the following algorithm: $V_{kj} = \frac{{V_{kj}*{{Count}\left( V_{kj} \right)}} + {{V_{kj}(l)}{{Count}\left( V_{kj} \right)}(1)}}{{{Count}\left( V_{kj} \right)} + 1}$ where I index for author J index for subject K index for the article L index for expert I max count for authors J max count for subjects K max count for the articles L max count for experts S_(jk) j_(th) Subject for article k V_(kj)(1) Veracity for article k in subject j rated by expert 1 V_(kj) Veracity for article k in subject j E_(ij) Eminence for Author i in subject j A_(kl) l_(th) author for article k
 13. Method of ensuring veracity of input data as defined in claim 1, wherein said computer is programmed to establish eminence index for author I in subject j by the following algorithm: $E_{ij} = \frac{{E_{ij}*{{Count}\left( E_{ij} \right)}} + V_{kj}}{{{Count}\left( E_{ij} \right)} + 1}$ where I index for author J index for subject K index for the article L index for expert I max count for authors J max count for subjects K max count for the articles L max count for experts S_(jk) j_(th) Subject for article k V_(kj)(1) Veracity for article k in subject j rated by expert 1 V_(kj) Veracity for article k in subject j E_(ij) Eminence for Author i in subject j A_(kl) l_(th) author for article k
 14. Method of ensuring veracity of input data as defined in claim 8, further comprising an adjacent expert database is accessible to said computer to identify an adjacent expert in the absence of an eminent expert pool for suggesting eminent experts.
 15. Method of ensuring veracity of input data as defined in claim 8, further comprising a database accessible to said computer for maintaining articles or documents that have been assigned veracity indices.
 16. Method of ensuring veracity of input data as defined in claim 8, further comprising a knowledge data base. 